Vivian Xia

MSDS453 - Research Assignment 01 - First Vectorized Representation

Importing Packages

NLTK Downloads

Mount Google Drive to Colab Environment

Functions

Top term

Method 1 Experiments

Data wrangling

baseline = tokenization + normalization (remove punctuation, remove short tokens with 3 characters or less, normalize to lowercase)

Source Class Corpus

Functions for pre-processing

Pre-process data set

Document Vectorization using tf-idf

Calculates TFIDF and Saves TFIDF Values for Terms

Terms by Embedding Matrix

100 Embedding Dimensions

Plot word2vec TSNE

Plot word2vec heatmap for cosine similarity

200 Embedding Dimensions

Plot word2vec TSNE

Plot word2vec heatmap for cosine similarity

300 Embedding Dimensions

Plot word2vec TSNE

Tokens are the embedding vectors -- the important weights/features

Plot word2vec heatmap for cosine similarity

Documents by Embedding Matrix

100 Embedding Dimensions

Plot doc2vec TSNE

Plot Heatmap of Cosine Similarity of Documents using TFIDF matrix

200 Embedding Dimensions

Plot doc2vec TSNE

Plot Heatmap of Cosine Similarity of Documents using TFIDF matrix

300 Embedding Dimensions

Plot doc2vec TSNE

Plot Heatmap of Cosine Similarity of Documents using TFIDF matrix

Method 2 Experiments

Data wrangling

baseline = tokenization + normalization + stemming + stop words

Functions for pre-processing

Pre-process data set

Document Vectorization using tf-idf

Calculates TFIDF and Saves TFIDF Values for Terms

Terms by Embedding Matrix

100 Embedding Dimensions

Plot word2vec TSNE

Plot word2vec heatmap for cosine similarity

200 Embedding Dimensions

Plot word2vec TSNE

Plot word2vec heatmap for cosine similarity

300 Embedding Dimensions

Plot word2vec TSNE

Plot word2vec heatmap for cosine similarity

Documents by Embedding Matrix

100 Embedding Dimensions

Plot doc2vec TSNE

Plot Heatmap of Cosine Similarity of Documents using TFIDF matrix

200 Embedding Dimensions

Plot doc2vec TSNE

Plot Heatmap of Cosine Similarity of Documents using TFIDF matrix

300 Embedding Dimensions

Plot doc2vec TSNE

Plot Heatmap of Cosine Similarity of Documents using TFIDF matrix

Method 3 Experiments

Data wrangling

baseline = tokenization + normalization + lemmatization + stop words + remove non-alphabetic tokens

Functions for pre-processing

Pre-process data set

Document Vectorization using tf-idf

Calculates TFIDF and Saves TFIDF Values for Terms

Terms by Embedding Matrix

100 Embedding Dimensions

Plot word2vec TSNE

Plot word2vec heatmap for cosine similarity

200 Embedding Dimensions

Plot word2vec TSNE

Plot word2vec heatmap for cosine similarity

300 Embedding Dimensions

Plot word2vec TSNE

Plot word2vec heatmap for cosine similarity

Documents by Embedding Matrix

100 Embedding Dimensions

Plot doc2vec TSNE

Plot Heatmap of Cosine Similarity of Documents using TFIDF matrix

200 Embedding Dimensions

Plot doc2vec TSNE

Plot Heatmap of Cosine Similarity of Documents using TFIDF matrix

300 Embedding Dimensions

Plot doc2vec TSNE

Plot Heatmap of Cosine Similarity of Documents using TFIDF matrix